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DETAILED DERIVATIONS OF SMALL-VARIANCE 
ASYMPTOTICS FOR SOME HIERARCHICAL BAYESIAN 
NONPARAMETRIC MODELS 

JONATHAN H. HUGGINS, ARDAVAN SAEEDI, AND MATTHEW J. JOHNSON 


Abstract. In this note we provide detailed derivations of two versions of 
small-variance asymptotics for hierarchical Dirichlet process (HDP) mixture 
models and the HDP hidden Markov model (HDP-HMM, a.k.a. the infinite 
HMM). We include derivations for the probabilities of certain CRP and CRF 
partitions, which are of more general interest. 


1. Introduction 


Numerous flexible Bayesian nonparametric models and associated inference algo¬ 
rithms have been developed in recent years for solving problems such as clustering 
and time series analysis. However, simpler approaches such as fc-means remain 
extremely popular due to their simplicity and scalability to the large-data setting. 

The fc-means optimization problem can be viewed as the small-variance limit of 
MAP inference in a fc-component Gaussian mixture model. That is, with observed 
data X = (x n )n =1 , x n £ R D , the Gaussian mixture model log joint density with 
means m ,..., p n £ R- 0 , cluster assignments Z = ( 2 n )^ =1 with z n £ {1,2,..., K}, 
and spherical variance a 2 is 


log p(p,Z,X) =log p(p)p(Z) - \jND log 2tt(t 2 

N 

= W Xn ~ MzJI 2 + o(/3), 

n= 1 


N 


X 


\\Xn - /TzJI 2 

2cr 2 


(1) 


where /3 = • As ct 2 —>• 0, or equivalently /3 —>• oo, the term that is linear in /3 
dominates and the MAP problem becomes the fc-means problem in the sense that 


lim argmaxlogp(/z, Z, X) = lim argmin/3> \\x n — lx Zn \\ 2 + o(/3) 
cr 2 ^0 z,» Z,ti „ 


= arg min V' | \x n - /j, Zn 
2 


( 2 ) 


Note that we have assumed the priors p(Z) and p(p) are positive and independent 
of a 2 . 

Recently developed small-variance asymptotics (SVA) UBl methods generalize 
the above derivation of fc-means to other Bayesian models, with nonparametric 
Bayesian models such as those based on the Dirichlet process (DP) and the In¬ 
dian buffet process being of particular interest. While obtaining fc-means from the 
Gaussian mixture model is straightforward, the SVA derivations for nonparametric 
models can be quite subtle, especially for hierarchical models. Indeed, we are not 


l 
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aware of a reference with the derivations for many important DP and hierarchal 
DP (HDP) probability expressions. This note is meant to serve as a self-contained 
reference to some DP and HDP material of general interest as well as SVA deriva¬ 
tions for HDP-based models. In particular, we provide derivations for the HDP 
mixture model and the HDP-HMM (a.k.a. the iHMM). 

A Caution. This note is not meant to serve as an introduction to SVA methods 
or Bayesian nonparametric modeling tools. Thus, we assume the reader is familiar 
with the MAD-Bayes approach to SVA Q and scaled exponential family distri¬ 
butions 0. We also assume basic familiarity with the Chinese restaurant process 
(CRP) and Chinese restaurant franchise (CRF) representations of, respectively, the 
DP and the HDP 0S- 


2. Preliminaries 

2.1. Notation. For an arbitrary real-valued vector v € R-° and indices 1 < i < 
j < D, let Vi-j = (vi, Vj+i, ..., Vj), Vj = Xf;=i v ii an d v ■ = Fd; we extend the 
range and dot notations in the obvious way to matrices and tensors. By convention 
TI 0 = 1 and ]T 0 = 0. 


2.2. Dirichlet process, Chinese restaurants, and all that. Recall that for 
a CRP with concentration parameter k, given that there are c observations, the 
probability of observing L tables with counts Cj..... c is: 


L 1 


Po " (e) "SnK>+o 

(3) 

= r“« + c,n«' 

(4) 


(5) 


where we have used the exchangeability of the CRP. 

For a CRF with concentration parameters k and a and c observations, let K be 
the number of tables in the top-level restaurant, let N be the number of franchises, 
let tij be the number of tables in restaurant franchise i serving dish j, and let ctjt 
be the number of customers at the t- th table serving dish j in restaurant i. For 
the top-level restaurant, the “customer” counts are (and for i-th franchise, the 
customer counts are (ci.j)j. Hence, repeatedly using Eq. ( 0 ), we have 


Pcrf(*,c | a, re) 


k k ~ 1 T(k+ 1) 

r(«; + 1..) 


K N 

n^n 


i=l i=1 


/ a ti ~ 1 T(a + 1) 

1 r(a + Cj..) 


K tij 

nn^ 


i=i*=i 


( 6 ) 


We can integrate over all possible seating arrangements for the customers in the 
franchises. That is, if C%j — ctj. is the number of customers eating dish j at 
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restaurant i, then 


P CRF{t, C\a,n) 


k k ~ 1 T(k + 1 ) 

r(« + 1..) 


K N 

IIMI! 


i=i *=1 


[ at " 1 r(g+l) -pr 
^ r(a + Cj.) 1 = 1 



(7) 


is an unsigned Stirling number of the first kind J|, which can be 

interpreted as the number of ways to seat Cij customers at tjj tables such that 
each table has at least one customer (more formally, the number of permutations 
of Cij objects with Uj disjoint cycles). 


where 


Ci. 


3. SVA FOR HDP MIXTURE MODELS 

The generative model for the HDP [§] with N groups and ,/, Gaussian observa¬ 
tions in group i is: 


/3 ~ GEM( 7 ) 


(8) 

tt; ~ 2XP(a/3), 

i > 1 

(9) 

/i fe - 3ST(0, erg), 

fc > 1 

(10) 

~ Multi(7Ti), 

j, i > 1 

(11) 

Vij 1 Zjj ~ Tsf)> 

j,i > 1, 

(12) 


Here GEM( 7 ) is the stick-breaking prior with concentration parameter 7 Q and 
yij is the j th observation in the i th group. Let Zi = (zij)jL ± , Vi — ( Vij)jLi > and 
K = max,, Zij. For the joint density of the HDP we have: 

P{Zl-.N, yi-.N, Ml:A' ^1:N,1:K, fil-K I CO; C, 7) 

K K 1131 

= F(Z 1:N , f3 1:K , 7ri ; jv,l :K \ «, 7 ) n n N(yij I Mfc; ® ) 11 | 0, CT 0 ), 

k=l ij:Zij =k k= 1 


where 


F{Z 1:N , (3 1:K , 7Ti:iV,l :K | 7 ) 

N N Ji 

= GEM(/3 1: *; 7 ) II a(3 1:K ) [] [] Multi^JTTip.^). 

i= 1 i=l j=l 

Integrating out /3 1:A - and ttun, we obtain the CRF representation in Eq. ( 0 ): 


(14) 


F(t,Z 1:N | a, 7 ) 


7 Jf - 1 r( 7 + 1 ) 

r (7 + t..) 


a: at 


I bd I 


( a* 8 ' 1 r(a + 1 ) yj 

^ F(a + Zi.) 1 = 1 



(15) 


As in the introduction, define /3 = so we are interested in the limit j3 —> 00 
(i.e., (T 2 —> 0). To maintain the effects of the hyperparameters in the small-variance 
limit, we set 7 = exp(—Ai/3) and a = exp(—A 2 /3), where Ai > 0 and A 2 > 0 are free 
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parameters. Taking the logarithm of Eq. (j7|), we obtain 


K 


logP(t, Zi- N | a, 7 ) = (K - 1) log 7 + log ^ log t.j ! 


r(7 + *~) 


(16) 


i=i 


+E I lo s«+ io § + E lo § 

z —' 1 r(a + 27) ' 


l=i 


JV 


= —(3\i(K - 1) + 0(1) - - 1) + 0(1)}. (17) 


If nrii = #{%} i i 1 is the number of distinct indices in Z t , then 

K K 

max -AiX-K - - 1) - A 2 Y(tj. - 1) =-Ai(i£T - 1) - A 2 YVrrij - 1), (18) 

t\t~z 1:N z —' z — J 


i =1 


i=l 


where t ~ 2^i : t denotes that t is consistent with Z\-t■ Hence, after setting the 
variance of /j,k to be ctq = cr 2 /A 3 , where A 3 > 0 is a free parameter, the SVA 
objective function for the HDP mixture model is 


K N K 

gnn. < Y, E \\yij-^k\\ 2 + X 1 (K-l) + X 2 ^2(m i -l) + X 3 J2\\f J -k\ 

11 fc=l ij:ztj=k i=l fc=l 


(19) 


This cost function is in fact the fc-means objective function with some additional 
penalty terms. The second and third terms in Eq. <© penalize the number of 
global and local clusters, respectively. The final term introduces the additional 
cost for the prior over the cluster means. 


4. SVA for the HDP-HMM 

The HDP-HMM generative model with Gaussian observations is Q: 


p r 

- GEM( 7 ) 


( 20 ) 

Itk r 

- 2XP(a/3), 

k > 1 

( 21 ) 

Mfc r 

^ ]Nf(0 , (Tq), 

fc > 1 

( 22 ) 

Zt | Z t -1 r 

- Multi(7T- t _ 1 ), 

t > 2 

(23) 

Ut\z t r 

T 2 ), 

f > 1 , 

(24) 


with zi = 1. Let Z = (zt)J= 1 , V — (yt)J- 1, and K = maxt=i,...,T Zt- The joint 
density of the model is 


P(Z, y , H 1:K) 7T 1:K , 1 :K, Pv.k I ctq , cr, a, 7 ) 


K 


= P(-Z, Pl:K, *1:K,1:K | «, 7) ff I Vz t ,C r 2 ) H(/r fc | 0, CT 2 ), 


k =1 


(25) 
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where 


| a, 7 ) 

A T-r T~T (26) 

= GEM(f3 1 . K ;'y)Y\_^('^k,i-.K;aP 1 . K ) Multi(z t ; TT Zt _ lil:K )- 

k t =1 

We consider two approaches to obtaining the SVA for the HDP-HMM. The first 
is a “combinatorial” approach, in which we integrate out P 1:K and tci-.k.i-.k- The 
second is a “direct” approach, in which we do not integrate out /3 1 . jK - and 


4.1. Combinatorial Approach. By integrating out f3 1 . K and 7r i : k, 1 :K, we obtain 
the CRF representation, which is the same urn scheme representation used in the 
original iHMM paper [s, Q. The development is exactly the same as the HDP 
mixture model case, see Eqs. 0 ) to 0 As before p = 7 = exp(—Ai/3), and 

a = exp(—A 2 /?), where Ai and A 2 are free parameters: 

K 

logP (t, Z\a,i) = -p\ i(K - 1) - p\ 2 - 1) + 0(1). (27) 

i= 1 

If Si is the number of distinct transitions out of state i, then 

K K 

max -Ai (K - 1) - A 2 - 1) = -Ai(AT - 1) - A 2 ^(s; - 1). (28) 

t \ t ~ z i— 1 i —1 

If we use the free parameter A 3 introduced in Section 0, then the SVA optimization 
problem for the HDP-HMM is 

{ T K K '| 

llm -RzJ 2 + Ai(A'- 1) + A 2 ^(si - 1) + A 3 ^] \\im\\ 2 \ . (29) 

t=l i=l i=l J 

In this cost function, the Ai term adds the cost for the total number of states 
and A 2 term penalizes the total number of distinct transitions out of the states. As 
in Eq. 0 , the last term represents the cost corresponding to the prior over the 
means of the states. 


4.2. Direct Approach. Alternatively, we can choose not to integrate out fli-x 
and 7Ti : k, 1 :K- If we let Pk+i — 1 — Pk and 77^+1 = 1 — Wi t K, then 


K 

= Beta 


i= 1 
K 

n 


Pi 


1 - Pi -1 


1,7 j Dir^i^+i | a/3 1: k+i) II ^ 

' t =2 


(30) 


T(1 + 7 ) ( 1 - 

T(7) 


1 — Pi-1 


7-1 K+l aPj-1 ] T 

rw Sw S'- 
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Let 7 = exp(—Ai/3) and a = A 2 / 3 . Taking the logarithm of the product from i = 1 
to K yields 


K \ 

y j < - logT( 7 ) + a log a - a + o(/3) 
*= i l 


K+l 


+ 'y {-a/3j log (a/3j) + a/3j + a/3j log 7 + o(/3)} 




R" 


RT+1 


= py { -Ar + y {-MPj log (pj) + A 2 pj log? Tij} > + o(/3) 


i—1 


3 =1 
K 


— |AiiL + A 2 ^ KL(/ 3 1 :Jsr+ 1 || 7 ri ; i : if + i) j + o(/3). 

Here we have used the asymptotic expansions of logT(z): 
logr(z) = z log z — z + o(z), 
logr(z) = - log 2 + O(z), 


(31) 


TTij} j +0(/3) 

(32) 

-o(/3). 

(33) 

z —> OO 

(34) 

z 4 , 0 . 

(35) 

’ is: 



mm ■ 

K,Z,P, TV 


y \\yt - p zt \\ 2 - cy\og', 


(36) 


t= 1 t =2 

K K \ 

+ Ai K + A 2 ^ KL(/3 1: _ k -_ | _ 1 I |'7Ti,l:JiC-+l) + A 3 y^ \\^i\\ 2 ( • 

t=1 t=l J 

Compared to Eq. d29l l. the main difference is in the terms involving hyperparam¬ 
eters £ and A 2 . In this cost function, the ( term penalizes transition probabilities 
very close to zero. The KL divergence term between /3 1: x+i an< i 7r i,i:R'+i is due 
to the hierarchical structure of the prior and it biases the transition probabilities 
i^i.i-.K+i to be similar to the prior fii-x+i- 
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